Timo Göbel, University of Konstanz, timo.goebel@uni-konstanz.de
Zdravko Monov, University of Konstanz, zdravko.monov@uni-konstanz.de
Toni Schmidt, University of Konstanz,
toni.schmidt@uni-konstanz.de
Peter Bak, University of Konstanz, peter.bak@uni-konstanz.de
We created a Java application for formatting data, computing frequencies
and plotting result values. The Lingpipe library [1]
conducted computation of the individual Levenshtein distances. Microsoft Excel
was used for the Visualizations of MC 3.1, MC 3.2 and MC 3.3. For MC 3.4 we
developed a Processing [2] application that
visualizes the results and allows sorting by user interaction. Our prototypes
were developed with an effort of approximately 40 hours.
Video:
Visualization tool:
Click
here to launch to visualization tool
ANSWERS:
MC3.1: What is the region or
country of origin for the current outbreak? Please provide your answer as
the name of the native viral strain along with a brief explanation.
The task of mini challenge 3.1 was to identify the
region or country of origin for the current outbreak. Our initial hypothesis
was that the country of origin differs less from all the outbreak sequences
than the other countries and regions. We compared all outbreak sequences to the
native sequences using the Levenshtein-Distance (Figure 1).
Figure 1. Schematic description of the Levenshtein distance and its
computation.
We created a Java application that calculated
the Levenshtein distance using the Lingpipe library [1]
and visualized the results in a bar chart. The implementation took
approximately 5 hours.
As shown in the following bar-char (Figure 2),
the country of origin is Nigeria_B, with the least (15, stdev=1.1) edit
operations in average.
Figure 2. Levenshtein distance of all sequences of the current outbreak
to all native sequences, the regions are shown on the x-axis, and the average
number of edit operations on the y-axis.
MC3.2: Over time, the virus
spreads and the diversity of the virus increases as it mutates. Two
patients infected with the Drafa virus are in the same hospital as
Nicolai. Nicolai has a strain identified by sequence 583. One
patient has a strain identified by sequence 123 and the other has a strain
identified by sequence 51. Assume only a single viral strain is in each
patient. Which patient likely contracted the illness from Nicolai and
why? Please provide your answer as the sequence number along with a brief
explanation.
The task of
mini challenge 3.2 was to identify which patient (#123 or #51) likely
contracted the illness from Nicolai and why. The determination of the patient
was performed with the Levenshtein distance, in which expected a lower number
of substitutions to indicate fewer mutations, and therefore higher likelihood
of direct infection. The data was imported into our Java application, processed
using the Levenshtein distance comparison and visualized in a table using Excel
(see Table 1) which took approximately 4 hours.
Table 1. Comparison of Nicolai’s
viral strain to the strains of patients 123 and 51 by using the Levenshtein
distance. Fewer substitutions indicate higher likelihood of direct infection.
Patient
123 is more likely to have contracted the illness from Nicolai since his strain
differs in only one substitution from Nicolai’s (A->C, 269). The strain of
Patient 51 differs from Nicolai’s in 3 substitutions (A->C, 494; C->T,
842; T->A, 946).
MC3.3: Signs and symptoms
of the Drafa virus are varied and humans react differently to infection.
Some mutant strains from the current outbreak have been reported as being worse
than others for the patients that come in contact with them.
Identify the top 3 mutations that
lead to an increase in symptom severity (a disease characteristic). The
mutations involve one or more base substitutions. For this question, the
biological properties of the underlying amino acid sequence patterns are not
significant in determining disease characteristics.
For each mutation provide the
base substitutions and their position in the sequence (left to right) where the
base substitutions occurred. For example,
C → G, 456 (C changed to G
at position 456)
G → A, 513 and T → A,
907 (G changed to A at position 513 and T changed to A at position 907)
A → G, 39 (A changed to G
at position 39)
In task 3
we had to determine the mutations leading to an increase in symptom severity.
In order to reach that goal, we applied a pair wise comparison of outbreak
sequences. More specifically, we compared all pairs that exhibit an increase in
symptom severity while all other characteristics remain unchanged. That way we
isolated the substitutions relevant only for symptom severity. We calculated
the relative frequency of these substitutions using a Java program which took
approximately 2 hours. The results are shown in Figure 3, in which the top 3
mutations are highlighted. Base positions 946 and 842 are the top two
mutations, 161 and 223 together form the third worst mutation since they have
an equal frequency.
Figure 3. The top 20 mutations
that lead to an increase in symptom severity. The top mutations occurring most frequently
in all the sequences of the current outbreak are highlighted in red.
MC3.4: Due to the rapid
spread of the virus and limited resources, medical personnel would like to
focus on treatments and quarantine procedures for the worst of the mutant
strains from the current outbreak, not just symptoms as in the previous
question. To find the most dangerous viral mutants, experts are
monitoring multiple disease characteristics.
Consider each virulence and drug
resistance characteristic as equally important. Identify the top 3
mutations that lead to the most dangerous viral strains. The mutations involve
one or more base substitutions. In a worst case scenario, a very
dangerous strain could cause severe symptoms, have high mortality, cause major
complications, exhibit resistance to anti viral drugs, and target high risk
groups. For this question, the biological properties of the underlying
amino acid sequence patterns are not significant in determining disease
characteristics.
For each mutation provide the
base substitutions and their position in the sequence (left to right) where the
base substitutions occurred. For example,
C → G, 456 (C changed to G
at position 456)
G → A, 513 and T → A,
907 (G changed to A at position 513 and T changed to A at position 907)
A → G, 39 (A changed to G
at position 39).
The task of mini-challenge 3.4 was
to identify the top three mutations that lead to the most dangerous viral
strains. Our concept relies on pair wise sequence comparison of selected patients.
The main point in this task is to isolate the base substitutions leading to the
worst property – severe for symptoms, high for mortality etc.
There are two ways to generate
pairs. The first procedure searches the dataset for pairs, in which only one
disease characteristic changes to its worst state, while the other
characteristics remain the same. This approach allows us to track individual
base substitutions. We call this procedure "exclusive-pairs". In a
preliminary analysis we noticed that the resulting pair set may not be
sufficient enough for a reliable and exhaustive solution. The second way to
generate pairs is to use all pairs, where one disease characteristic changes to
its worst state, while ignoring the others. We call this approach "greedy-pairs".
This model led to a larger dataset including alternative answers.
Once we had differentiated the bases
for each disease characteristic, we applied an intersection operation. The
overlapping of all bases yields the desired mutations.
We propose a ranking, which orders
substitutions according to their frequency of occurrence. Our idea was to count
the occurrence of each base substitution and then calculate the frequency
relative to the number of patient pairs. Finally, we could use the weighted
average of all frequencies among each disease characteristic as the comparison
factor.
We extended our Java application to
handle all disease characteristics; the implementation took approximately 30
hours. The algorithm generated patient pairs using either the exclusive- or the
greedy-pairs procedure. A typical example would include pairs exposing mild to
severe or moderate to severe symptoms – such pairs are relevant for the final
result, whereas pairs exposing mild to moderate symptoms are discarded. The
program determines the base substitutions and notes their relative frequency. A
union of those pairs leads to a set of base substitutions for the corresponding
disease characteristic. This process is repeated for all disease
characteristics.
We then visualized our results using
a separate Java application, which employs the Processing graphics library [2]. This visualization plots the base substitutions for
each disease characteristic as well as the corresponding weighted average. We
only considered bases in which at least one mutation appears, so that the
visualized data is reduced to its relevant segments. Table 2 shows a typical
output from our Processing tool. A cell indicates an individual base
substitution; its label is located in the column header. On the left of each
line, labels indicate the corresponding disease characteristic. For example,
"To Severe Symptoms" means that this line shows mutations leading to
severe symptoms. The number in brackets shows how many pairs were generated to
extract this information. A white cell means that no base substitution occurred
on that position for the corresponding disease characteristic.
For each base substitution we show
the relative frequency with respect to the total number of pairs. This number
is shown in each cell. The color intensity of the cell scales proportional to
it. This offers the viewer the possibility to quickly compare frequency values.
Table 2. Output from the
Processing tool: the grayscale color of each cell is an indicator of the
frequency of the base substitution. The bottom line provides a weighted summary
for each base by computing the average frequency of substitutions.
The bottom line shows the weighted
average of the frequency of each base change. Since we wanted to extract base
substitutions that worsen all disease characteristics, we only plot frequency
values into cells corresponding to a substitution that worsens all disease
characteristics. The application offers the possibility to sort each line by
frequency. Table 3 shows an example of this case. This way the viewer can
quickly see which mutations lead to the worst disease characteristics.
Table 3. Base
substitutions sorted according to their positions
Also, each disease characteristic
can be sorted according to the frequency of the base substitutions - the user can
click on the left side over the label of a disease characteristic. Table 4
depicts the situation, where the user has sorted the mutations leading to High
Mortality, thus gaining direct knowledge over the most influencing
substitutions.
Table 4. Base substitutions
sorted according to High Mortality influence
Finally, the bottom line - Weighted
Average - can be sorted according to the overall frequency of the mutations.
Using our Processing visualization tool, we were able to quickly determine the
mutations, leading to the worst virus.
Table 5 displays the summary of our
task for the exclusive-pairs model. The weighted average line shows no numbers
in the top 3 bases, which means that there is no single substitution that
affects all disease characteristics. Concerning the weighted average of the
substitutions with highest frequencies we come up with the following top three
mutations:
1. T→C, 842
2. A→T, 946
3. G→C, 161.
Table 5. Results using the
“exclusive-pairs” approach displayed in the red rectangle. There is no
overlapping in all disease characteristics, when exclusive-pairs are used. The
bottom cells are empty since we only plot frequency values into cells that
affect all disease characteristics.
However, these results reflect only a subset of
the total number of pairs. Using greedy-pairs, our results slightly change. Now
there are substitutions that affect all disease characteristics. We consider
the property of affecting all characteristics as more relevant than the
absolute weighted averages of the frequencies. Substitutions affecting all
characteristics may not occur with highest absolute frequency, yet only they
indicate a mutation to a virus that alters all disease characteristics to their
worst value. Following this policy, our final result, as shown in Table 6, is:
1*. G→C, 161
2*. G→C, 22
3*. C→A, 79.
Table 6. End results using the
"greedy-pairs" generation approach. The red rectangle encompasses the
top 3 base substitutions with the higher occurrence frequency. The blue
rectangle, on the other side, highlights the substitutions which affect all
disease characteristics.
[1] LingPipe text
processing toolkit for Java. http://alias-i.com/lingpipe/index.html.
[2] Processing, an
open source programming language for creating images, animations and
interactions. http://processing.org.